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Abstract  Topic  modeling  is  a  generalization  of  clustering  that  posits  that  observations 
(words  in  a  document)  are  generated  by  multiple  latent  factors  (topics),  as  opposed  to 
just  one.  The  increased  representational  power  comes  at  the  cost  of  a  more  challenging 
unsupervised  learning  problem  for  estimating  the  topic-word  distributions  when  only 
words  are  observed,  and  the  topics  are  hidden.  This  work  provides  a  simple  and  efficient 
learning  procedure  that  is  guaranteed  to  recover  the  parameters  for  a  wide  class  of 
multi-view  models  and  topic  models,  including  latent  Dirichlet  allocation  (LDA). 
For  LDA,  the  procedure  correctly  recovers  both  the  topic-word  distributions  and  the 
parameters  of  the  Dirichlet  prior  over  the  topic  mixtures,  using  only  trigram  statistics 
(i.e.,  third  order  moments,  which  may  be  estimated  with  documents  containing  just 
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1  Introduction 

Topic  models  use  latent  variables  to  explain  the  observed  (co-)occurrences  of  words 
in  documents.  They  posit  that  each  document  is  associated  with  a  (possibly  sparse) 
mixture  of  active  topics,  and  that  each  word  in  the  document  is  accounted  for  (in  fact, 
generated)  by  one  of  these  active  topics.  In  latent  Dirichlet  allocation  (LDA)  [12],  a 
Dirichlet  prior  gives  the  distribution  of  active  topics  in  documents.  LDA  and  related 
models  possess  a  rich  representational  power  because  they  allow  for  documents  to  be 
comprised  of  words  from  several  topics,  rather  than  just  a  single  topic.  This  increased 
representational  power  comes  at  the  cost  of  a  more  challenging  unsupervised  esti¬ 
mation  problem,  when  only  the  words  are  observed  and  the  corresponding  topics  are 
hidden. 

In  practice,  the  most  common  unsupervised  estimation  procedures  for  topic  mod¬ 
els  are  based  on  finding  maximum  likelihood  estimates,  through  either  local  search  or 
sampling  based  methods,  e.g.,  Expectation-Maximization  [43],  Gibbs  sampling  [22], 
and  variational  approaches  [12].  Another  body  of  tools  is  based  on  matrix  factoriza¬ 
tion  [26,36],  For  document  modeling,  a  typical  goal  is  to  form  a  sparse  decomposition 
of  a  term-document  matrix  (which  represents  the  word  counts  in  each  document)  into 
two  parts:  one  which  specifies  the  active  topics  in  each  document  and  the  other  which 
specifies  the  distributions  of  words  under  each  topic. 

This  work  provides  an  alternative  approach  to  parameter  recovery  based  on  the 
method  of  moments  [42],  which  attempts  to  match  the  observed  moments  with  those 
posited  by  the  model.  Our  approach  does  this  efficiently  through  a  particular  decompo¬ 
sition  of  the  low-order  observable  moments,  which  can  be  extracted  using  an  orthog¬ 
onal  tensor  decomposition.  This  method  is  simple  and  efficient  to  implement,  and  is 
guaranteed  to  recover  the  parameters  of  a  wide  class  of  topic  models,  including  the 
LDA  model.  We  exploit  exchangeability  of  the  observed  variables  and,  more  gener¬ 
ally,  the  availability  of  multiple  views  drawn  independently  from  the  same  hidden 
component. 


1 . 1  Summary  of  Contributions 

We  present  a  spectral  approach  to  topic  model  estimation  based  on  decomposing  the 
low-order  (cross)  moments  of  observed  variables.  The  approach  differs  from  other 
spectral  methods  (e.g.,  those  based  on  Principal  Component  Analysis  and  Canonical 
Correlation  Analysis)  in  that  it  is  based  on  an  orthogonal  tensor  decomposition  of  a 
k  x  k  x  k  third-order  moment  tensor,  where  k  is  the  number  of  latent  factors  (topics). 
In  many  applications,  k  is  typically  much  smaller  than  the  dimension  of  the  observed 
space  d  (number  of  words). 
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The  method  is  applicable  to  a  wide  class  of  latent  variable  models  including 
exchangeable  and  multi-view  models.  We  first  consider  the  class  of  exchangeable 
variables  with  independent  latent  factors.  We  show  that  the  low-order  moment  tensors 
possess  a  decomposition  known  as  a  canonical  polyadic  decomposition  [34]  (Lemma  1 
and  Lemma  2  in  Sect.  3.1).  We  then  consider  LDA  and  show  that,  even  though  it  does 
not  directly  possess  an  independent  latent  factor,  it  nearly  does  so  in  a  rigorous  sense, 
and  hence  a  simple  combination  of  lower-order  moments  has  the  required  decompo¬ 
sition  just  as  in  the  previous  case  (Lemma  3  in  Sect.  3.2).  Given  these  moments  from 
either  the  independent  latent  factors  model  or  LDA,  a  simple  and  computationally 
efficient  algebraic  procedure  [23]  recovers  the  model  parameters  exactly  under  a  mild 
rank  condition  (Theorem  2  in  Sect.  3.3).  Informally,  we  have  the  following. 

Theorem  1  (Informal)  Given  low-order  ( i.e.,  order  <  3  or  <  4)  moments  from  either 
the  independent  latent  factors  model  or  LDA  satisfying  a  rank  condition,  there  is  a 
polynomial  time  randomized  algorithm  that  returns  the  model  parameters  up  to  scaling 
and  permutation. 

We  describe  the  parameter  recovery  procedures  assuming  exact  moments  as  input, 
but  it  is  straightforward  to  write  down  analogous  “plug-in”  estimators  that  use  estimates 
of  these  moments  based  on  sampled  data.  We  do  just  this  using  a  particular  tensor 
decomposition  procedure  from  [5],  and  analyze  the  error  of  the  resulting  parameter 
estimates  in  terms  of  the  errors  in  the  moment  estimates  (Theorem  3). 

1.2  Related  Work 

Under  the  assumption  that  a  single  active  topic  occurs  in  each  document,  the  work  of 
[41  ]  provides  the  first  theoretical  guarantees  for  recovering  the  topic  distributions  (i.e., 
the  distribution  of  words  under  each  topic),  albeit  with  a  rather  stringent  separation 
condition  (where  the  words  in  each  topic  are  essentially  non-overlapping).  Under¬ 
standing  what  separation  conditions  permit  efficient  learning  is  a  natural  question;  in 
the  clustering  literature,  a  line  of  work  has  focussed  on  understanding  the  relationship 
between  the  separation  of  the  mixture  components  and  the  complexity  of  learning. 
For  clustering,  the  first  learnability  result  [19]  was  under  a  rather  strong  separation 
condition;  subsequent  results  relaxed  [1,10, 16, 17,20,33,45]  or  removed  these  condi¬ 
tions  [1 1,28,32,38];  roughly  speaking,  learning  under  a  weaker  separation  condition 
is  more  challenging,  both  computationally  and  statistically.  For  the  topic  modeling 
problem  in  which  only  a  single  active  topic  is  present  per  document,  [6]  provides  an 
algorithm  for  learning  topics  with  no  separation  requirement,  but  under  a  certain  full 
rank  assumption  on  the  topic  probability  matrix. 

For  the  case  of  LDA  (where  each  document  may  be  about  multiple  topics),  the  recent 
work  of  [8]  provides  the  first  theoretical  result  under  a  natural  separation  condition.  The 
condition  requires  that  each  topic  be  associated  with  “anchor  words”  that  only  occur 
in  documents  about  that  topic.  This  is  a  significantly  milder  assumption  than  the  one 
in  [41].  Under  this  assumption,  [8]  provides  the  first  rigorously  analyzed  algorithm  for 
learning  the  topic  distributions.  Their  work  also  justifies  the  use  of  non-negative  matrix 
(NMF)  as  a  provable  procedure  for  this  problem  (the  original  motivation  for  NMF  was 
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as  a  topic  modeling  technique,  though,  prior  to  this  work,  formal  guarantees  as  such 
were  rather  limited).  Furthermore,  [8]  provides  results  for  certain  correlated  topic 
models.  Our  approach  makes  further  progress  on  this  problem  by  relaxing  the  need 
for  this  separation  condition  and  providing  a  simpler  parameter  estimation  procedure. 

The  underlying  approach  we  take  is  a  certain  diagonalization  technique  of  the 
observed  moments.  We  know  of  at  least  three  different  settings  which  use  this  idea  for 
parameter  estimation. 

The  algebraic  tensor  decomposition  technique  of  [23]  (see  also  [35,37])*  was 
rediscovered  by  [15]  for  parameter  estimation  in  discrete  Markov  models.  The  idea 
has  been  extended  to  other  discrete  mixture  models  such  as  discrete  hidden  Markov 
models  (HMMs)  and  mixture  models  with  a  single  active  topic  in  each  document 
(see  [6,29,39]).  For  such  single  topic  models,  the  work  in  [6]  demonstrates  the  gener¬ 
ality  of  the  eigenvector  method  and  the  irrelevance  of  the  noise  model  for  the  obser¬ 
vations,  making  it  applicable  to  both  discrete  models  like  HMMs  as  well  as  certain 
Gaussian  mixture  models  (see  also  [28]). 

Another  set  of  related  techniques  is  the  body  of  algebraic  methods  used  for  the  prob¬ 
lem  of  blind  source  separation  [14].  These  approaches  are  tailored  for  independent 
source  separation  with  additive  noise  (usually  Gaussian)  [18].  Much  of  the  literature 
focuses  on  understanding  the  effects  of  measurement  noise,  which  often  requires  more 
sophisticated  algebraic  tools  (typically,  knowledge  of  noise  statistics  or  the  availability 
of  multiple  views  of  the  latent  factors  is  not  assumed).  These  ideas  are  also  used  by 
[21,40]  for  learning  a  linear  transformation  (in  a  noiseless  setting)  and  provides  a  dif¬ 
ferent  algorithm,  based  on  a  certain  ascent  algorithm  (rather  than  joint  diagonalization 
approach,  as  in  [14]),  and  a  provably  correct  algorithm  for  the  noisy  case  was  only 
recently  obtained  [9]. 

The  models  we  consider,  including  LDA,  are  distinguished  by  the  presence  of 
exchangeable  (or  multi-view)  variables  (e.g.,  multiple  words  in  a  document),  drawn 
independently  conditioned  on  the  same  hidden  state.  This  allows  us  to  exploit  ideas 
from  the  aforementioned  works  [14, 15,23].  In  particular,  we  show  that  the  topic  mod¬ 
eling  problem  exhibits  a  similarly  simple  algebraic  solution  as  in  previous  works. 
Furthermore,  the  exchangeability  assumption  permits  us  to  have  an  arbitrary  noise 
model  (rather  than  an  additive  Gaussian  noise,  which  is  not  appropriate  for  multino¬ 
mial  and  other  discrete  distributions).  A  key  technical  contribution  is  that  we  show 
how  the  basic  diagonalization  approach  can  be  adapted  for  Dirichlet  models,  through  a 
rather  careful  construction.  This  construction  bridges  the  gap  between  mixture  models 
and  independent  latent  factors  models. 

The  multi-view  approach  has  been  exploited  in  previous  works  for  semi-supervised 
learning  and  for  learning  mixtures  of  well-separated  distributions  (e.g.,  [7, 16, 17,31]). 
These  previous  works  essentially  use  variants  of  canonical  correlation  analysis  [27] 
between  the  two  views.  This  present  work  follows  [6]  in  showing  that  having  a  third 
view  of  the  data  permits  rather  simple  estimation  procedures  for  parameter  recovery 
without  separation  conditions. 


1  The  technique  of  [23]  is  actually  attributed  to  Robert  Jennrich. 
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Fig.  1  Directed  graphical 
model  representation  of  the 
multi-view  models 


A  preliminary  version  of  this  work  [3,4]  proposed  a  method  based  on  using  two 
singular  value  decompositions — essentially  a  symmetrized  version  of  the  algorithm 
from  [6].  However,  that  method  is  not  particularly  robust,  as  it  depends  on  randomiza¬ 
tion  to  create  separation  between  eigenvalues  of  a  particular  matrix.  In  this  article,  we 
propose  to  use  the  robust  orthogonal  tensor  decomposition  method  from  [5],  which  is 
more  robust  than  the  previous  method  and  still  computationally  efficient. 

2  The  Multi-view  Models 

Let  h  =  (h ] .  hj,  . . . ,  hf)  e  M*  be  a  random  vector  specifying  the  latent  factors 
(i.e.,  h  is  the  hidden  state)  in  a  model,  where  hi  is  the  value  of  the  ith  factor.  Let 
xj ,  Jt2,  ■  ■  ■  ■  Xf  e  WJ  he  random  vectors  which  we  take  to  be  observable.  Throughout, 
we  assume  t  >  3,  and  the  reason  will  become  clear  in  the  sequel.  The  primary 
modeling  assumption  is  that  the  observable  random  vectors  and  latent  factors  satisfy  the 
following  conditional  independence  and  linearity  condition;  we  assume  this  condition 
holds  the  remainder  of  this  paper. 

Condition  1  [Multi-view  model]  The  observations  x\,  xj.  . . . ,  xi  are  conditionally 

independent  given  h.  Moreover,  for  each  v  e  [£]  :=  [1,2,  i},  there  exists  a  matrix 

q(v)  e]gdxk  sucfl  that 


E[jc„|A]  =  0(v)h. 

The  conditional  independence  assumption  from  Condition  1  is  depicted  in  the  directed 
graphical  model  in  Fig.  1 .  This  model  generalizes  the  multi- view  model  from  [6]  in  that 
li  is  not  assumed  to  only  take  values  in  \e\ ,  ej.  ■  ■  ■ ,  c'k),  where  e,  e  [0,  1  }*  is  the  ith 
coordinate  basis  vector.  We  note  that  at  this  stage,  we  have  not  made  any  assumptions 
on  the  noise  model;  it  need  not  be  additive  nor  even  independent  of  h? 

We  also  assume  throughout  that  the  following  rank  condition  on  the  matrices  0(v> 
from  Condition  1. 

Condition  2  (Rank  condition)  Each  Olv>  has  full  column  rank. 

Condition  2  allows  for  the  identifiability  of  the  columns  of  ()<vl  and  was  used  in 
previous  works  on  parameter  estimation  [6,8,9, 15,21,28,39,40]. 

As  shown  in  [6]  (for  the  case  where  h  has  support  only  on  k  points),  the  multi-view 
model  captures  a  number  of  interesting  statistical  models,  including  certain  Gaussian 


-  By  additive  noise,  we  mean  a  model  in  which  x ,,  =  0>,  ] h  +  //,.  where  rjv  is  a  zero-mean  random  vector 
independent  of  h. 
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mixture  models  and  HMMs.  In  our  setting,  h  is  allowed  to  have  a  more  general 
distribution,  thus  enhancing  the  flexibility  of  the  model.  Our  goal  in  this  work  is  to 
given  estimators  of  the  matrices  0{v^  (and  possibly  parameters  related  to  the  latent 
factor  h),  solely  from  repeated  observations  (independent  copies  of  x  i .  xj,  ■  ■  • ,  xf). 

We  now  consider  two  specializations  of  this  multi-view  model  in  which  the  different 
views  x i ,  X2,  •  ■  ■ ,  X(  are  naturally  exchangeable.  For  these  cases,  the  conditional 
means  of  the  different  views  are  the  same.  That  is,  the  following  condition  holds. 

Condition  3  O ^  =  O  for  all  v  e  [£]. 

In  the  context  of  topic  models,  the  common  matrix  O  is  referred  to  as  the  topic  matrix, 
as  it  specifies  parameters  associated  with  each  topic  in  the  model.  Borrowing  this 
terminology,  we  generally  refer  to  all  as  topic  matrices. 

2. 1  Independent  Latent  Factors  Model 

In  the  independent  latent  factors  model,  we  assume  h  has  a  product  distribution, 

i.e.,  h  i ,  Ii2 . hk  are  independent.  In  the  case  where  the  xv  are  deterministic  linear 

functions  of  h,  (i.e.,  xv  —  Oh),  the  model  reduces  to  the  noiseless  ICA  model  of  [30], 
which  was  reinterpreted  as  a  parallelepiped  learning  problem  in  [21,40]. 

Two  important  examples  of  this  setting  with  noise  are  as  follows. 

Mixture  of  Gaussians  Suppose  xv  =  Oh  +  t],  where  //  is  zero-mean  Gaussian  noise 
and  h  has  a  product  distribution  over  [0,  1  }  .  Here,  the  z  th  column  o,  of  O  can  be 
considered  to  be  the  mean  of  the  i  th  Gaussian  component.  This  generalizes  the  classic 
mixture  of  k  Gaussians,  as  the  model  now  permits  any  number  of  Gaussians  to  be 
responsible  for  generating  an  observation  (i.e.,  h  is  permitted  to  be  any  of  the  2k 
vectors  on  the  hypercube,  while  in  the  classic  mixture  problem,  only  one  component 
is  responsible).  Alternatively,  the  model  can  be  viewed  as  a  classical  mixture  of  2k 
Gaussians,  where  the  mean  of  a  component  S  e  2^  is  X/eS  °>  ant*  its  mixing  weight 
is  Pr (hi  =  IV/  e  S )  [9].  Note  that  //  is  allowed  to  be  heteroskedastic  (i.e.,  the  noise 
may  depend  on  h),  so  the  Gaussians  need  not  have  the  same  covariance. 

Poisson  topic  model  This  is  a  model  proposed  by  Canny  [13].  Suppose  [Oh\j  specifies 
the  Poisson  rate  of  counts  for  [x„] : .  For  example,  jc„  could  be  a  vector  of  word  counts 
in  the  i/th  sentence  of  a  document.  Here,  O  would  be  a  matrix  with  positive  entries, 
and  hi  would  scale  the  rate  at  which  topic  i  generates  words  in  a  sentence  (as  specified 
by  the  /'th  column  of  O).  The  linearity  assumption  in  Condition  1  is  satisfied  as 
E[jr„|/z]  =  Oh  (note  the  noise  is  not  additive  in  this  case,  in  contrast  to  the  mixture 
of  Gaussians  example).  Here,  multiple  topics  may  be  responsible  for  generating  the 
words  in  each  sentence.  This  model  similar  to  LDA,  except  for  the  fact  that  here,  h 
has  a  product  distribution  (whereas  in  LDA,  h  is  a  probability  vector). 

2.2  The  Dirichlet  Model 

Now  suppose  the  hidden  state  h  is  a  probability  distribution  itself,  with  a  density 
specified  by  the  Dirichlet  distribution  with  parameter  vector  a  =  (ct\,  a.2,  ■  ■  ■ ,  ctk)  e 
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Ra;0  (a  is  a  strictly  positive  real  vector);  this  assumption  is  written  as  h  ~  Dir(a).  We 
think  of  li  e  A*-1  as  a  distribution  over  topics;  here,  A*  1  denotes  the  probability 
simplex  of  distributions  over  k  outcomes.  The  density  of  h,  the  Dirichlet  density,  is 


the  “pseudo-counts”)  characterizes  the  concentration  of  the  distribution.  As  «o  —>  0, 
the  distribution  degenerates  to  one  over  pure  topics  (i.e.,  the  limiting  density  is  one  in 
which,  almost  surely,  exactly  one  coordinate  of  h  is  1,  and  the  rest  are  0). 

If  the  xv  are  deterministic  linear  functions  of  h  (i.e.,  jc„  =  Oh),  then  the  model  can 
be  viewed  as  the  problem  of  learning  a  certain  class  of  distributions  over  a  simplex 
with  vertices  o\ ,  02,  ■ . . ,  0*>  The  special  case  of  a  uniform  distribution  over  a  simplex 
(a  problem  suggested  in  [21])  is  obtained  when  a  =  (1,  1, . . . ,  1). 

Latent  Dirichlet  Allocation  LDA  makes  the  further  assumption  that  each  random  vec¬ 
tor  x  1 ,  X2,  ■  ■  ■ ,  X(  takes  on  discrete  values  out  of  d  outcomes  (e.g.,  xv  represents  what 
the  nth  word  in  a  document  is,  so  d  represents  the  number  of  words  in  a  vocabulary). 
The  / th  column  o,  of  O  is  a  probability  vector  representing  the  distribution  over  words 
for  the  /th  topic.  The  sampling  process  for  a  document  is  as  follows.  First,  the  topic 
mixture  h  is  drawn  from  the  Dirichlet  distribution.  Then,  the  nth  word  in  the  docu¬ 
ment  (for  v  e  [£]  :=  { 1 , 2,  ...,£})  is  generated  by:  (i)  drawing  t  e  [k]  according  to 
the  discrete  distribution  specified  by  h,  then  (ii)  drawing  xv  according  to  the  discrete 
distribution  specified  by  ot.  Note  that  xv  is  independent  of  h  given  t.  For  this  model 
to  fit  in  our  setting,  we  use  the  “one-hot”  encoding  for  xv  (also  used  by  [6]  in  this 
context):  xv  e  [0,  l}rf  with 

[jt„] j  =  1  the  nth  word  in  the  document  is  the  /th  vocabulary  word. 

Observe  that 


k  k 


E[jt„|/i]  =  ^^Pr[/  =  i\h]  ■  =  Z,  h]  =  ^  hj  ■  Oj  =  Oh 


as  required  in  Condition  1 .  Again,  note  that  the  noise  model  is  not  additive. 

3  Moment  Structure  in  Multi-view  Models 

In  this  section,  we  describe  the  structure  of  moments  for  the  multi-view  models,  focus¬ 
ing  primarily  on  the  exchangeable  models  (the  independent  latent  factors  model  and 
the  Dirichlet  model).  Recall  that  for  these  exchangeable  models,  we  assume  0^  =  0 
for  all  v  e  [(]  (Condition  3).  In  all  cases,  the  pth-order  (not  necessarily  raw)  moments 
are  of  the  form 
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k 


;=1 


p  times 


where  c\p,  C2,P,  ■  ■  ■ ,  Ck.P  e  R.  are  scalars  possibly  depending  on  p.  This  is  known  as  a 
symmetric  canonical  polyadic  decomposition  [24,25,34]  (or  a  Tucker  decomposition 
with  a  diagonal  core  tensor  [44]).  We  show  that  if  the  cpp  are  non-zero,  then  the  o,  (up 
to  some  scaling  and  permutation)  can  be  extracted  from  these  moments  using  simple 
algebraic  techniques  such  as  that  from  [23],  We  also  give  a  reduction  from  the  general 
multi-view  setting  to  the  exchangeable  setting. 

3.1  Moments  of  Skewed  and  Kurt  otic  Independent  Factors 

Let  nii  p  :=  E[(/i;  —  E[/i,])p]  denote  the  pth  central  moment  of  h\.  The  variance, 
skewness,  and  kurtosis  of  hj  are  given  by  <r?  :=  m(j2,  Yi  '■=  ,  and  at,-  := 

miA/(7t  ~  3,  respectively. 

Define  the  following  moments  of  the  observable  random  vectors: 


p  :=  E[jci]  e 

Pairs  :=  E[(jti  -  p)  ®  (JC2  -  /<■)]  e  Rdx'/, 

Triples  :=  E[(jti  —  p)  <g>  (JC2  —  p)  ®  (X3  -  p)]  e  Mdx£/xd 


(1) 


Here,  we  use  <g>  to  denote  the  tensor  product.  For  instance,  given  vectors  u,v,weS!/J, 
the  tensor  product  u  (g>  v  <g>  w  e  M.dxdxd  is  the  third-order  tensor  whose  ( i ,  j,  /:)th 
entry  is  UiVjWk . 

Lemma  1  (Independent  latent  factors  moments)  Assume  Conditions  1  and  3  and  that 
h  has  a  product  distribution.  Then 


p  =  ^  ,  Pairs  =  ^  ni/,20;  <S>  o , , 


k 


k 


k 

Triples  =  ^  /n/,30/  ®  0,  ®  o, . 


;  =  1 


Proof  Since  E[x,,  |//  |  =  Oh  (by  Conditions  1  and  3)  we  have 


k 


p  =  OE[h]  =  ^  nij  1 0, . 


This  also  implies 


k 


E[(jc„  -  p)\h]  =  0(h  —  E[/?])  =  y^(hj  —  E[/j,])o, . 
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Since  x  i  —  fi  and  xi  —  are  conditionally  independent  given  h  (under  Condition  1), 
we  may  write  Pairs  as  the  expectation  of  a  tensor  product: 


Pairs  =  E 


E [hi])Oi 


(hi  -  E(hi])Oi 


Expanding  out  the  tensor  product  and  applying  linearity  leaves  only  the  diagonal  terms 
with  E [(hi  —  E [hi])2].  This  is  because  lr  has  a  product  distribution  by  assumption, 
and  thus  E [(hi  —  E[/i;])(/r  /  —  E [/; y])]  =  0  for  i  ^  j.  With  only  the  diagonal  terms  in 
the  product  remaining,  we  have 

k  k 

Pairs  =  ^E [(hi  —  E[/?;])2]o;  <g>  o,  =  ^  m;,2Q;  0  0;. 

1  =  1  1  =  1 


An  analogous  argument  gives  the  claim  for  Triples.  □ 

Lemma  1  shows  that  fi,  Pairs,  and  Triples  possess  the  structure  needed  to  apply  the 
tensor  decomposition  technique  of  [5]  to  extract  the  o, .  However,  this  is  only  possible  if 
the  scalar  factors  are  non-zero;  in  Lemma  1,  the  scalar  factors  are  the  central  moments 
mj  p.  If  h  has  a  distribution  symmetric  about  its  mean,  the  third  central  moment  is 
zero,  and  hence  Triples  cannot  be  used.  One  recourse  comes  from  the  literature  on 
independent  component  analysis;  if  the  distribution  of  h  is  kurtotic  (i.e.,  /c;  ^  0),  then 
one  may  use  fourth-order  moments  in  the  form  of  the  fourth  cumulant  tensor  [14].  In 
the  next  lemma,  we  show  that  this  can  also  be  applied  with  multi-view/exchangeable 
models. 

Define 


Quadruples  :=  E[(xi  —  ft)  0>  (X2  —  fi)  0  (*3  —  ft)  0  (JC4  —  /t)]  —  T  (2) 

where  T  e  JHdxdxdxd  js  the  fourth-order  tensor  whose  (i,  j,  m,  «)th  entry  is 

Tij.m.n  =  [Pairs],-,  7-[Pairs]m,„  +  [Pairs];,  m  [Pairs]  +  [Pairs];,,,  [Pairs]  y,m. 

Lemma  2  (Independent  latent  factors  moments,  fourth-order)  Under  the  same  setting 
as  Lemma  1, 


k 

Quadruples  =  ^(w;, 4  —  3 mj2)Oi  0  o,  0)  o;  0  o, . 

i=  1 

Proof  By  the  same  argument  as  in  Lemma  1,  we  have  that 

E[(JC1  —  //,)  0  (Xi  —  fi)  0  (*3  —  fl)  0  (X4  —  p,)]  =  E[t!  0  «  0  V  0  »] 
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where 


k 

v  :=  Till,  -  E [hi])Oj. 

1=1 

Expanding  the  tensor  product  and  applying  linearity  of  expectation  leaves  only  terms 
involving  E[(/i;  — E[/i;])4]  and  E[(/z;  —  E  [hi])2(hj  —  E[/i  ;])2],  again  due  to  the  product 
distribution  of  li.  Thus 

E[d  <g>  v  ®  v  ®  u] 

k 

=  ®  Oi  ®  Oj  ®  Oi 

i=i 

+  ^  ( O,  <g>  Oi  ®  Oj  <g>  Oj  +  Oi  <g>  Oj  <g>  Oi  ®  Oj  +  Oi  <g>  0/  <g>  Oj  ®  O,-) 

i^j 

k 

=  —  3mj2)Oi  <g>  Oi  <g>  o,  ®  o, 

1=1 

+  7W/,2W  j,2  ( Oi  ®  0/  ®  Oy  ®  Oj  +  0/  ®  Oj  ®  O;  ®  Oj  +  Oj  ®  0/  0  Oj  ®  O/) 

‘.j 

k 

=  ^(/«/,4  —  mj2)°i  ®  0/  ®  0/  ®  Oj  +  T 
1=1 

where  the  step  uses  the  identity  for  Pairs  from  Lemma  1 .  Rearranging  the  terms  gives 
the  desired  identity.  □ 

We  note  that  if  h  is  an  isotropic  Gaussian  random  vector,  then  both  m;, 3  and  711,4  — 
3m?2  are  zero  for  all  i  e  |T],  causing  both  Triples  and  Quadruples  to  vanish.  As 
expected,  these  higher-order  moments  cannot  help  with  the  identification  of  O  without 
further  assumptions:  this  is  simply  because  g  :=  UT h  has  the  same  distribution  as  li 
when  U  is  orthogonal,  and  E[jCu|/j]  =  Oh  =  (OU)g. 

3.2  Moments  of  Dirichlet  Factors 

The  Dirichlet  distribution  is  not  a  product  distribution,  and  therefore  Lemma  1  does  not 
apply  to  models  where  h  ~  Dir  (a ) .  However,  it  is  nearly  a  product  distribution.  Indeed, 
if/i  ~  Dirfa),  then  h  has  the  same  distribution  x /  X/Ti  x/ ,  where  4/  ~  Gamma(a;,  1) 
(i.e.,  Xi  follows  a  Gamma  distribution  with  shape  parameter  a,  and  scale  parameter 
1),  and  xi,X2 , ,Xk  are  independent.  Thus,  the  essential  reason  why  the  Dirichlet 
distribution  is  not  a  product  distribution  is  because  it  is  restricted  to  the  probability 
simplex. 

Lemma  3  (below)  nevertheless  shows  that  if  the  concentration  parameter  ao  is 
known,  the  correlations  introduced  by  restricting  to  the  probability  simplex  can  be 
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accounted  for  by  slightly  tweaking  the  moments;  this  tweaking  causes  the  same  kind 
of  structure  as  in  Lemma  1  to  manifest. 

Define  the  following  moments  of  the  observable  random  vectors: 


P 

Pairs„0 

Triples^ 


=  E[jc  i  ]  e  Mr, 

=  E[*i  0  *2]  - 

a0  +  1 

=  E[xi  0  X2  0  JC3] 

ao 


0,0  /t0(ie  Mdxd , 


a0 


^E[X[  0  JC2  0  fl]  +  E[JC1  0  fl  0  JC3]  +  E [ft  0  X2  0  JC3]^ 


2cka 


(«o  +  2)(a0  +  1) 


/10/I0(l£ 


\>dxdxd 


(3) 


Lemma  3  (Dirichlet  factors  moments)  Assume  Conditions  1  and  3  and  that  h  ~ 
Dir(a).  Then 


i=i 


T  =  —Qj,  Pairsff0  =  ^ 

i=i 
k 

Triples^  =  ^ 


(a0  +  l)ao 
2  a. 


Oj  0  Oj , 


1=1 


(a0  +  2)(a0  +  l)«o 


-  Oj  0  Oi  0  Oj. 


Proof  We  directly  calculate  univariate,  bivariate,  and  trivariate  moments  of  h :  for  any 
distinct  i,  j,  l  e  [k], 


ao  (ao  +  l)«o 

CtiCLi  o 

E [hihj]  =  - ,  E [hfhj]  = 


El*?]  =  +  + 

(ao  +  2)  (ar0  +  l)a0 

(a;  +  1  )oti<Xj 


E[hihjhi ]  = 


(ao  +  1)«0 
a/ a  jCti 


(ao  +  2)(a0  +  l)a0  ’ 


(a0  +  2)(a0  +  l)a0 
Putting  these  in  vector,  matrix,  and  third-order  tensor  form, 


1  k 

E[A]  =  —  ^  a,  et 
ao  .. 

1  =  1 

E [h  0  h]  = - a,ej  0  e,  +  a  0  a  ) 

(ao  +  l)ao  \“j'  / 

/  k 

( 2^a,(e, 

v  i=i 


E[A  0  h  0  h]  = 


(a0  +  2)(a0  +  l)a0 


|  e i  0  e, 
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k  k 

+  ^  at  (e,  <g>  e,-  <g>  a)  +  ^  a,  (a  <g>  e,  <g>  e,  ) 

;=i  i=i 

k 

+  ^  a;  (e,-  ®  a  <g)  ef)  +  a  ®  a  <g>  a 

i=i 

Since  E[jct, \h]  =  Oh  by  Conditions  1  and  3  (as  in  the  proof  of  Lemma  1),  and 
because  the  xv  are  conditionally  independent  given  h,  the  claim  follows  by  linearity 
and  rearranging.  □ 

3.3  Identifiability  from  Low-order  Moments 

We  now  provide  a  simple  argument  that  establishes  the  identifiability  of  the  columns 
of  the  topic  matrix  O  (up  to  scaling  and  permutation)  from  the  low-order  moments 
considered  in  Lemmas  1,  2,  and  3.  Here,  Condition  2  plays  an  essential  role,  as  does 
the  non-Gaussianity  of  h  in  the  case  of  the  independent  latent  factors  model. 

Theorem  2  (Identifiability  from  low-order  moments)  Assume  Conditions  1,  2,  and 
3  hold. 

1.  If  h  has  a  product  distribution  and  each  h,  has  positive  variance  of  and  non¬ 
zero  skew  yi,  then  there  is  a  randomized  algorithm  that,  given  Pairs  and  Triples 
from  (1),  returns  the  set  of  vectors  returns  { (s;  ay  Oj ,  si  Yi )  :  i  G  [k]}  for  some 

{Si,,S2,  ...,5*}  c  {±1}. 

2.  Ifh  has  a  product  distribution  and  each  h ,■  has  positive  variance  of  and  non-zero 
kurtosis  Ki,  then  there  is  a  randomized  algorithm  that,  given  Pairs  and  Quadruples 
from  (1)  and  (2),  returns  {{siOjOj ,  SjKj)  :  i  G  [k])  for  some  {s\,  sz,  ■  ■  ■ , .%}  C  {±1}. 

3.  If  h  ~  Dir(a)  for  some  a  e  Kv  then  there  is  a  randomized  algorithm  that, 

given  Pairsao  and  Triples^  from.  (3),  returns  {(sj^/cfjoi ,  SiCi^/c/f)  :  i  e  [£]} 
for  some  . . . ,  .v*. }  C  {±1},  where  C{j 2  =  di/((oio  +  l)«o)  an<I  c/  3  = 

2q!,/((q'o  +  2)(a0  +  l)ao)- 

Proof  The  theorem  follows  from  the  structural  results  from  Lemma  1,  Lemma  2,  and 
Lemma  3,  combined  with  Lemma  4  (below).  □ 

Lemma  4  is  a  variant  of  a  result  from  [23],  which  establishes  the  uniqueness  of 
(symmetric)  canonical  polyadic  decompositions  under  a  rank  condition.  Our  variant 
uses  a  symmetrized  version  of  procedure  from  [23],  and  is  proved  using  the  following 
observations.  First,  if  the  second-order  moment  matrix  (called  P  in  Lemma  4)  has 
rank  k,  then  it  defines  an  inner  product  system  under  which  the  0,  are  orthogonal. 
Moreover,  under  this  inner  product,  the  0,  are  orthogonal  eigenvectors  of  a  generic 
flattening  of  higher-order  moment  tensors  (T  or  Q  in  Lemma  4). 

Below,  for  a  vector  v  =  (v\,  V2,  ■  ■  ■ ,  Vd)  e  Rd,  let  T [u]  e  Rdxd  denote  the  matrix 
whose  (i,  y)th  entry  is  X/Li  vi[T]jjj. 

Lemma  4  Let  O  :=  [oi|o2|  •  •  •  | ok]  e  Rdxk,  P  :=  £?=i  q,2o/  (8)  o,  e  M£/xd,  T  := 
S/=t  ci,3°i®°i  ®°i  e  Rdxdxd,  and  Q  :=  X/=i  ci,4°i  ®  Oi  <8  o,-  0  o,  e  ]^dxdxdxd. 
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Algorithm  1  Identification  from  exact  moments 

input  positive  integer  A:  e  N,  P  e  Rr/xi/  and  T  e  |R.4x4x4  sa[j5{\  jnn  conditions  in  Lemma  4. 
output  {(u,-,  A;)  :  i  e  [A]). 

1 :  Let  2;=i  Vi  Ki®Ki  he  an  eigendecomposition  of  P,  where  { :  i  £  [A:])  are  orthonormal  eigenvectors, 
and  {tjj  :  i  e  [A:]}  are  corresponding  positive  eigenvalues. 

2:  Set  W  :=  [?1|?2|  •  •  •  l^-ldiagd /^n,  1 /Jrji,  1 /,/%). 

3:  Draw  9  uniformly  at  random  from  the  unit  sphere  in  R*. 

4:  Let  <5/ § /  (g>  f  be  an  eigendecomposition  of  W~^  T  {W  8}W ,  where  {0,  :  i  e  [A:]}  are  orthonormal 

eigenvectors,  and  {<5;  :  i  e  [A])  are  corresponding  non-zero  eigenvalues. 

5:  Let  u,  :=  (WT)t|,  and  A;  :=  {W^)TT{Wf}Wf  for  each;  e  [A], 


Assume  that  O  has  full  column  rank  and  that  c;, 2  >  0  for  all  i  e  [A].  If  c;,3  ^  0  for 
all  i  G  [A],  then  there  is  a  randomized  algorithm  that,  given  P  and  T  as  input,  returns 
{(SiCj  2  Oi,siCii3/cf2  )  :  i  e  [k]}  for  some  {si,S2,  c  {±1}.  If  CiA  ^  0  for  all 

i  e  [A],  then  the  same  holds  with  Q  in  place  ofT  and  CiA/cf2  in  place  ofc(A/ci  2  . 

Proof  Under  the  assumptions  on  O  and  the  c,-: 2,  the  matrix  P  is  positive  definite.  By 
rescaling  the  columns  of  O  (replacing  0,  by  Jcffoi )),  we  may  assume  that  c;, 2  =  1 
for  all/  e  [A].  Therefore  there  exists  a  matrix  VI’  e  M.dxk,  which  can  be  obtained  from 
the  singular  value  decomposition  of  P,  such  that  VI'  P  IT  =  ( IV’  ())( VI' j  0)T  =  Ik. 
This  in  turn  implies  that  M  VV' !  O  is  orthogonal.  Then,  for  any  vector  0  e  , 

WtT(W0)W  =  iyTOdiag(diag(ci,3,  c2,3,  . . . ,  ck,3)0T  W0)OT  W 
=  Mdiag(diag(ci,3,  C2,3,  •  •  ■ .  ckA)MT0)MT . 

Since  M  is  orthogonal,  the  above  displayed  equation  gives  an  eigendecomposition  of 
WtT[W0}W,  and  the  set  of  eigenvectors  corresponding  to  non-repeated  eigenvalues 
are  uniquely  defined  up  to  sign.  Each  such  unit-norm  eigenvector  §,  (after  appropriate 
reordering)  is  of  the  form  SjMe ,  =  s,  IV'  0,  for  some  Sj  e  {±1}.  Therefore  u,  := 
(Vyt)T^.  =  .v,  VT( IT  IT)~*  1 *  Wo,  =  SjOi,  since  range(IT)  =  range) O).  Finally, 

A 

Xi  :=  (W^)tT{W^}W^  =  3(sio]WWToi)3 * * * 7 

7  =  1 

k 

=  Si^CjMe]  0TWWTOei)3  =  siCi,  3. 

7=1 

Assume  that  c(- 3  f  0  for  all  /  e  [A].  If  0  is  drawn  uniformly  at  ran¬ 
dom  from  Sk~l,  then  so  is  M  0 .  In  this  case,  almost  surely,  the  entries  of 
diag(diag(ci:3,  C23, . . . ,  ckA)MT0)  are  unique.  Hence,  no  eigenvalue  of 
WJT{W0}W  is  repeated,  so  the  set  {(T; ,  a,)  :  i  e  [A]}  has  the  property  claimed 
by  the  lemma  and  is  returned  by  Algorithm  1 . 

An  analogous  argument  proves  the  claim  assuming  c;, 4  f  0  for  all  i  e  [A]  using 
Q  in  place  of  T.  □ 
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The  critical  aspect  in  Theorem  2  is  that  only  low-order  observable  moments  are 
needed  to  identify  the  columns  of  O.  The  observability  implies  that  they  can  be 
estimated  from  data,  and  used  in  a  plug-in  estimator.  The  low-order  nature  of  the 
moments  mean  that  they  can  be  estimated  reliably  (relative  to  higher-order  moments). 

From  Theorem  2,  we  may  also  conclude  that  since  ft ,  Pairs,,,, ,  and  Triples,*  involve 
only  xi,  x 2,  and  xj,  a  random  document  under  the  LDA  model  need  only  have  three 
words  (conditionally  independent  given  h ).  This  is  the  same  conclusion  as  in  the 
mixture  of  multinomial  model  for  documents  studied  in  [6],  even  though  in  that  model, 
the  words  in  documents  were  assumed  to  only  be  generated  from  a  single  topic. 

It  is  worth  noting  that  Pairs,,,,  and  Triples^  depend  on  the  value  ao  =  a,-, 

so  it  must  be  known  in  order  to  form  the  appropriate  moment  estimates.  In  some 
applications,  one  may  have  prior  knowledge  of  ao,  as  it  characterizes  the  concentration 
of  the  Dirichlet  distribution  (and,  indeed,  having  prior  knowledge  of  ao  is  indeed 
much  weaker  than  knowing  the  entire  parameter  vector  a).  The  particular  case  where 
a  —  (1,  1,...,  1 )  is  essentially  the  problem  of  learning  the  vertices  of  a  simplex  when 
given  access  to  (noisy)  samples  drawn  from  the  uniform  distribution  over  this  simplex; 
therefore  our  result  resolves  this  open  problem  posed  by  [21]. 

In  the  case  of  the  independent  latent  factors  model,  the  third-  and  fourth-order 
moments  are  exploited  to  take  advantage  of  the  non-Gaussianity  of  h  (which  is  not 
possible  with  only  first-  and  second-order  moments,  without  further  assumptions). 

3.4  Reducing  the  General  Multi-view  Setting  to  the  Exchangeable  Setting 

To  conclude  this  section,  we  show  how  to  reduce  the  general  multi-view  setting  (where 
the  different  ()<vl  are  not  necessarily  the  same)  to  the  case  where  the  0(v>  =  O  for  all 
v  e  [£].  Here,  we  must  assume  that  the  number  of  views  is  at  least  three  (i.e.,£  >  3, just 
as  in  the  independent  latent  factors  and  Dirichlet  models,  since  Triples,  Quadruples, 
and  Triples^  all  depend  on  at  least  x\,X2,  and  X3).  This  reduction  is  based  on  a  proof 
technique  used  in  [2] . 

Define  Pairs,,  „  :=  E[x„  ®  x„]  for  u,  v  e  [£].  Fix  any  Vo  e  [£],  which  we  call  the 
target  view.  Then,  for  each  v  e  [€]\{vo},  pick  any  u  =  n( v)  e  [£]\{u,  uo}  (which  is 
possible  since  l  >  3)  and  define 

Cv^vo  :=  Pairs,  ,  ,, Pairs’^,  C„0^„  :=  Pairs,  „Pairsi)  U. 

Proposition  1  (View  symmetrization)  Assume  Conditions  1  and  2  hold,  and  also 
that  E[/i  <g)  h]  is  invertible.  Fix  a  target  view  vq  e  [l],  and  let  O  :=  O  (vo).  Let 
Cvo-s,vo  :=  Ici,  and  for  each  view  v  e  [£],  let  xv  :=  Cv^v()x„  where  CV^,VQ  is 
defined  above  for  v  ^  Vo-  For  each  v  e  [i], 

E[xv\h]  =  Oh,  0{v)  =  Cvo^vO. 

Proof  Assume  without  loss  of  generality  that  vo  :=  1,  v  :=  2,  and  u  —  u ( v )  :=  3 
(in  the  definition  of  C2->i).  By  Condition  1,  Pairs  13  =  0{l)E[h  ®  h] 0<3)T 
and  Pairs!  ,  =  0(2)E[/i  <g>  h  |0,3)T .  By  Condition  2  and  the  assumption  that 


5)  Springer 


Author's  personal  copy 


Algorithmica  (2015)  72:193-214  207 

E[A  <S>  h]  is  invertible,  both  Pairs  1.3  and  PairS2,3  have  rank  k.  Moreover,  Pairs]  3  = 
(0(3)t)+E[/i  <g>  h]~lO(2)\  so 

C2^i  =  (tf(1)E[/i®  A]0(3)T^(0(3)T)fE[/i  ®/i]-10(2)t)  =  O(1)0(2)t. 

Since  Oi2)  has  full  column  rank  (by  Condition  2), 

E[x2\h]  =  C2^\E[x2\h]  =  0(l)0(2n0(1)h  =  Omh  =  Oh. 

By  the  same  argument,  Ci_*.2  =  0{2)  ;  =  Oi2)  O ' :  therefore  C\^20  = 

0{2)  O  b  =  0(2^  since  O  has  full  column  rank.  □ 


4  Estimation  from  Moments  via  Orthogonal  Tensor  Decomposition 

In  addition  to  establishing  identifiability.  Theorem  2  provides  an  efficient  randomized 
algorithm  for  recovering  the  columns  of  O  up  to  permutation  and  scaling,  and  it  indeed 
can  be  shown  to  work  using  only  estimates  of  the  required  moments  using  techniques 
similar  to  [6,39].  However,  the  resulting  sample  complexity  is  rather  high  (i.e.,  a  high- 
degree  polynomial)  on  account  of  the  randomization  technique  used  to  ensure  the 
uniqueness  of  an  eigendecomposition.  Indeed,  the  randomization  collapses  moment 
tensors  into  matrices,  but  the  resulting  spacing  between  eigenvalues  is  small;  hence, 
such  procedures  may  not  be  very  robust  to  errors  in  the  moment  estimates.  It  turns 
out  this  is  only  an  artifact  of  the  algorithmic  technique,  as  a  different  technique  based 
on  orthogonal  tensor  decomposition  has  both  a  better  analysis  and  better  empirical 
support.  In  this  section,  we  recall  the  tensor  decomposition  technique  from  [5]  and 
show  how  it  can  be  applied  to  the  estimation  problem  for  the  multi-view  models 
considered  in  this  work. 


4. 1  Multi-way  Arrays,  Multi-linear  Forms,  and  Tensor  Algebras 

Recall  that  low-order  moments  from  Sect.  3  have  the  form  X/=i  ci.p°f P  where 
v ®p  v  ®  v  0  -  -  -  <g>  v  denotes  the  /3-fold  tensor  product  of  a  vector  v  with  itself. 
We  assume  that 


P  :=^Ci'2of2  and  T  :=^citPofp 
1- 1  (=1 

for  some  p  >  3  are  given,  with  c,-, 2  >  0  and  Ci  p  >  0  for  all  i  e  [£],  and 
{01,02,  ...  ,Ok}  C  R'7  linearly  independent.  We  note  that  it  is  also  possible  to  handle 
the  case  where  c/, 2  <  0  and  Ci  p  <  0  for  some  i  e  [k]  (using  techniques  from  [46]), 
though  we  avoid  here  this  case  for  simplicity.  In  our  examples,  we  will  specialize  to 
the  p  —  3  case. 
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From  the  second-order  moment  matrix  P ,  as  discussed  in  the  proof  of  Lemma  4,  we 
may  extract  a  so-called  whitening  matrix  W  e  M.dxk  using  the  singular  value  decom¬ 
position  of  P  such  that  WT  PW  =  Ik-  Put  another  way,  P  defines  a  ^-dimensional 
inner  product  subspace  of  in  which  B  :=  <Jc2,202,  ■  ■  ■ ,  Jck,20k\  are 

orthonormal.  To  see  this,  we  define  the  inner  product  by 

{«,  v)  :=  nT 


Since  the  o,-  are  linearly  independent  and  q, 2  >  0, 

P  =  (0T)tdiag(cii2,  c2,2,  •  •  • , 
where  O  '  O  —  Ik-  This  implies  that 


(- s/Ch20i , 


0  if  1967 
1  if  i  =  j 


as  claimed. 

The  pth-order  tensor  T  can  be  viewed  in  a  number  of  different  ways.  The  first  is 
as  a  p- way  array  of  real  numbers  7}  1,i2,...,ip  for  i\,  h, . . . ,  ip  e  [d].  The  second  is  as 
a  p-linear  form  T  :  x  x  •  •  •  x  W1  — >  I  (generalizing  the  bilinear  form  view 

of  a  matrix):  for  any  u  =  (u  1,  M2, . . . ,  ud)  €  v  =  (ui,  V2,  ■  ■  ■ ,  Vd )  e  Rd,  and 
w  —  (w\ ,  W2,  •  •  • ,  vod)  e 


T(m,  w,  w)  =  y  TjjjUjVjWi, 
i,jJ 


which  connects  the  p- way  array  with  the  p-linear  form.  We  will  generally  denote, 
fort/  =  [H1|M2|---|Md]  e  R"!lX£/,y  =  [di|d2|---M  e  Rm2Xd,  and  W  = 
[ttfi|u>2|  ■  ■  ■  \wd\  e  K'"3X</,  the  tensor  T (t/,  V,  W)  e  Mm| xm2Xm3  gjven  py 

T(l/,  V,  W)  =  y  TijjUj  ®  iij  <2>  up. 
i.j.l 

(Note  that  the  notation  “rfi;}”  from  Lemma  4  can  be  written  as  T (Id,  Id,  u).) 

The  third  view  of  T  is  as  a  member  of  (Wl  )(x>p ,  the  pth-order  tensor  product  of 
In  this  view,  we  may  pick  a  basis  for  this  vector  space,  and  represent  T  in  that 
basis.  A  natural  choice  is  derived  from  the  standard  coordinate  basis  for  IR^,  giving 
{e,  ®  e  j  ®  ei  :  i,  j,  l  e  [c/]}.  In  this  basis,  we  identify  the  p- way  array  as  a  tensor 
algebra  element: 


T  =  y  Ttjjet  <g)  ej  <8>  ei. 
i,j,I 


Alternatively,  if  B '  \ yc\  20 1 ,  ^Jt'2.202 ,  ■  ■  ■  *Jcd,2°d }  is  a  completion  of  B  from 

above,  we  may  derive  a  different  basis  {u  ®  v  ®  w  :  u,  v,  w  e  B’}  for  (Rd)®P,  under 
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which  T  has  the  following  diagonal  representation: 


k 


Note  that  T  as  a  p-linear  form  can  be  expressed  in  terms  of  the  standard  inner  product 

uTv  =  X/=i  uivi  giving 

T(u ,  D,  w)  =  y  TjjjjeJ u)(e] v)(ej w ). 
i,jd 


The  proposed  orthogonal  tensor  decomposition  algorithm  from  [5]  exploits  the  p- 
linear  form  based  on  the  inner  product  (•,  •},  as  it  exploits  the  orthogonality  of  the 
s/ciyor. 

T(P] u,  P  V,  P  ’w)  =  Tijiia,  u)(ej,  V){ei,  w) 
k 

=  T,  -^niVcj^Oi,  vHJcijOi,  W) 

i= 1  Ci,2 

Indeed,  it  is  shown  that  the  map 


T{Id ,  PV  Ptu) 

PV  P'd),  P(/rf,  Pit,  P1  v)) 


(4) 


has  non- zero  stable  fixed  points  only  at  ^/cipOj  for  i  e  [A:].  Moreover,  repeated 
application  of  the  above  map  starting  from  a  random  v  e  range  (P)  D  {u  e  R';  : 
||  u  || 2  =  1}  converges  at  a  quadratic  rate  to  one  of  the  yci  jo,  (Lemma  5.1  from  [5]). 
Finally,  it  is  easy  to  check  that  T(P  ycijOj ,  P '  y/cipOi ,  P '  Jc'ipo, )  =  aj /c(3 ^  ■ 


4.2  Efficient  Estimation  Algorithm 


In  an  actual  implementation,  we  must  estimate  the  moment  matrices  and  tensors  P  and 
T  from  data.  Let  S  :=  {(jCj  ,  Xj  \  . . . ,  :  j  e  [«]}  be  an  i.i.d.  sample  comprised 

of  n  independent  copies  of  (x  \ ,  X2,  . . . ,  X().  We  may  form  estimates,  P  and  T,  of  P 
and  T,  respectively,  using  empirical  averages  with  respect  to  S.  For  instance,  in  LDA, 
we  may  we  let 


p  ■.=  -  yy, 

1 1  *  ^ 


O') 


7=1 


1 


n*—\  -  ao  +  1 

7  =  1 


PairSo,0  :=  -  <8>  x\ 


p  <g>  p, 
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Triples^  :=  - 


Aj) 


7=1 

a0 


a0 


(7) 


2“0 


-(7) 


(«o  +  2)(ao  +  1) 


.(7) 


fi  +  x 


(7) 

1 


)  ji  ( 


„0') 


T  * 


-(7) 


/r  <g>  fi  <8>  /t, 


-3“)) 


and  set  P  :=  Pairsff0  and  T  :=  Triplesff0.  Note  that  we  may  also  ensure  that  P  and  P 
are  symmetric  (so  [P],j,/  =  [T]Uj  =  [T]jjj  etc.)-  This  gives  consistent  estimates 
of  P  and  T . 

Using  P  and  T .  we  propose  a  plug-in  approach  to  estimation  for  O,  given  as 
Algorithm  2  (Lines  5-13  make  up  the  robust  tensor  power  method  of  [5]).  This  is 
essentially  a  robust  version  of  the  iteration  given  by  (4)  which  finds  an  approximate 
orthogonal  tensor  decomposition  of  T  given  the  nearby  estimate  T .  The  algorithm 

from  [5]  is  applied  to  the  tensor  T(W ,  IT .  IT),  and  the  outputs  v,  and  A,-  should  be 

3/2 

interpreted  as  estimates  of  the  y/CjpOj  and  Cj^/c.  2  for  some  j  e  [fc]. 


4.3  Analysis  of  Algorithm  2 

In  this  section,  we  give  a  simple  error  analysis  for  Algorithm  2.  Let  04  >  02  > 

3/2 

■■■  a i  >  0  be  the  non-zero  singular  values  of  P .  Also,  let  A;  :=  c/^/c-  2  ,  ordered  so 
that  A.i  >  A.2  >  •  ■  ■  >  A*  >  0.  We  define  the  operator  norm  of  a  symmetric  third-order 
tensor  E  e  Hr*"'*"1  by 


E lb  :=  max  |£(i>,  i>,  v)|. 
Wl2=l 


Defineep  :=  ||  P  —  P  Wi/ctk  and  ep  '■=  ||P  —  P||  2.  The  error  bound  is  given  in  terms 
of  ep  and  ej  ■ 

Theorem  3  (Error  analysis  of  Algorithm  2)  There  exist  universal  constants 
C,  C' ,  c,  c'  >  0  such  that  the  following  holds.  Pick  any  S  e  (0,  1).  If 


€p  <  c 


kkfk\ 


€t  <  C  ■ 


Xko, 


3/2 


N 


>  C^logW  +  Iog(log(^-  +  ^))) 


L  >  poly(£)  log(l/<5) 


(for  some  fixed  polynomial  poly(A:)  specified  in  Theorem  5.1  of  [5]),  then  with  proba¬ 
bility  at  least  1  — 5,  Algorithm  2  returns  {(£,- ,  Xj)  :  i  6  [k]}  satisfying,  after  appropriate 
reordering, 
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Algorithm  2  Plug-in  estimator  based  on  orthogonal  tensor  decomposition 


input  positive  integer  JeN,  symmetric  matrix  P  6  x d ,  SymmetI-jc  tensor  T  e  Rc,xrfxrf,  number  of 
iterations  L  and  N. 
output  {(i; ,  ij)  :  i  e  [4]}. 

1:  Compute  eigendecomposition  of  P,  let  >  Tji  >  •••  >  %  be  the  top  k  eigenvalues,  and  let 
£  1 ,  f  2  >  ■■  ■  •  (k  -  be  the  corresponding  orthonormal  eigenvectors. 

(Halt  and  signal  failure  if  %  <  0.) 


2:  Set  W  :=  [f  ilf2|  ■  ■  ■  1/Jn, 1/Jrjf). 

(VnVT  is  the  pseudoinverse  of  a  rank-1:  approximation  to  P.) 
3:  Initialize  T  :=  T . 

4:  for  i  =  1  to  k  do 
5:  for  r  =  1  to  L  do 

6:  Draw  0^  uniformly  at  random  from  the  unit  sphere  in  R*. 

7:  for  t  =  1  to  JV  do 

8:  Compute  power  iteration  update: 


u  :=  T (W,  Wep\ ,  0,(r) 


U 

Wii' 


(*) 


9:  end  for 

10:  end  for 
11:  Let 

r*  :=  arg  max  T(W0\ iP ,  WO1)? ,  Woff) 
ze[L ]  v 

(r*)  •  ~ 

and  execute  N  more  power  iteration  updates  (★)  starting  from  0  N  to  obtain  0; . 
12:  Let 


Vj  :=  (WT)td;,  ij  :=  T(W0j,  W0,,  W0p. 

13:  Deflate  T\  set  T  :=  T  —  A;  i;  <2>  »,•  (8>  i,- . 

14:  end  for 


,  /  1  1  /Xi 

Wvi-OihSC'  ■[- - j '€r+  71 

\Xi  \Xj 

l^i  —  A,,- 1  <  C'  ■  ^  2/2  ‘  €t  + 


for  all  i  e  [A:]. 

Proof  We  assume  without  loss  of  generality  that  =  1  for  all  i  e  [&],  so  P  —  O  O, 

and  a;  =  C;  3  for  all  i  e  [Ar] .  By  Lemma  8  from  [28],  if  II  is  the  orthogonal  projection 
to  span{£j,  £2,  ■  ■  ■ .  £&}  (w.r.t.  the  standard  inner  product),  and  77  is  the  orthogonal 
projection  to  range) O)  (again  w.r.t.  the  standard  inner  product),  then 

||(/d-  77)77||2<  l.5eP.  (5) 


Note  that  (WT)f  =  [^l^l  •  •  •  |£*]diag(</77i,  y/m,  ■  •  •  >  and  therefore  77  = 

(WT)t#T. 
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Let  W  :=  By  Lemma  9  and  Lemma  11  from  [28], 


1 

||W||2  <  ; 

VTT  -  eP)ak 

||(W  -  W)tO||2  <  y/l  +  l.5ePl.5eP , 

T(W,  W ,  W)  -  T(W,  W ,  W)||2  <  '  6T  +  611  '  €p=:s- 

ak 


(6) 

(7) 

(8) 


Note  that 


k 

T(W ,  W,  W)  =  ^L/(Wto,)  <g>  (WTOi)  ®  (WTo,-) 
1=1 


where  {WT0/  :  i  e  [k]}  are  orthonormal.  Using  (8),  Theorem  5.1  from  [5] 
implies  the  following:  if  s  <  C\Xk/k,N  >  C2(log(k)  +  loglog(Ai/e)),  and 
L  >  poly(k)  log(  1  /<5),  then  with  probability  at  least  1  —  5,  the  robust  tensor  power 
method  returns  Oj  and  satisfying  (after  appropriate  reordering) 

||0;  -  Wto,-||2  <  \ii-Xi\<5s.  (9) 

A; 


Therefore, 


I  Vi  ~  Oj  ||  2 


((# )r0t  -  (#T)tWToi) 


+  ((#  )fWTOj  —  (W  )tW  Oj)  +  ((W  A W  Oj  -  Oj) 

<  |(#T)t0,-  -  (wT AwT Oj  j| | 

+  |(WT)tWTo/  -  (#T)t#T0/|j,  +  |(WT)t#T0/  -o;||2 


<  uwtA\\2 


(||0;  -  WTo 


«I|2+II(W-  W)1  0\ 


C n  - 1 d)rioj\ 


<  ,  ,  1  ( ^  +  yr+L5^lW)  +  1.5ep 

V(!  -  *p)ak  \M  ) 


using  (5),  (6),  (9),  and  (7).  □ 

We  remark  that  Theorem  3  immediately  implies  estimation  consistency  under 
appropriate  assumptions  on  the  ay  and  A.;,  and  it  is  straightforward  to  obtain  finite 
sample  guarantees  using  concentration  arguments  to  bound  ||  P  —  P\\2  and  \\T  —  T\\j. 
We  leave  this  as  an  exercise  for  the  reader. 
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